This notebook is in addition to Business Understanding, Data Understanding, Project Planning file in project repository. There is also another R file which solves more questions.
The notebook is split into several parts. The questions of the project which have been analysed here are the following:
1) To which time period customers are looking for rental cars now/yesterday/last week/last month?
2) Are there unexpected changes (peaks, drops) in bookings or price requests?
3) How many price requests could be cached (have same pickup and return location, source country and driver age) for 1 hour, 3 hours, 12 hours?
4) What is actual demand after eliminating duplicated requests?
In this notebook Pandas is mainly used. Also making use of plotly to make interactive graphs. Seaborn for non-interactive graphs. In some graphs also cufflinks was used for easier implementation. Note that we learned new stuff during creation of this notebook, so a lot of things could be written in a easier manner...
Rate quote data. In this notebook we use the first file we were given, around 23 GB of rate quote data from 01.12.2017 to 03.12.2017.
Reservations data. Making use of all reservations data we were given.
The description of columns and values in the data set are described in "Business Understanding, Data Understanding, Project Planning" file.
File with information to map contract_id's with brokers.
We saw that that the data itself was really clean. At least the columns that we required in this project. In total there was more than 80 GB of request data and 32,000 KB reservations data.
import pandas as pd
import csv
%matplotlib inline
%pylab inline
pylab.rcParams['figure.figsize'] = (14, 6)
%load_ext autoreload
%autoreload 2
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
from plotly import graph_objs as go
init_notebook_mode()
fields = ["timestamp", "pickup_timestamp", "return_timestamp", "broker_contract_id", "driver_age",
"request_uuid", "source_country_region_id", "pickup_desk_id", "return_desk_id", "rental_days"]
df = pd.read_csv("rate_quote.csv", skipinitialspace=True, usecols=fields)
print(df.shape)
We see that first file had close to 3 million rows.
Converting timestamp rows to datetime datatype. Pandas does it neatly for us.
df["timestamp"] = pd.to_datetime(df['timestamp'])
df["pickup_timestamp"] = pd.to_datetime(df['pickup_timestamp'])
df["return_timestamp"] = pd.to_datetime(df['return_timestamp'])
Merging rate quote data with info about brokers. We add Broker_name column to our data with this.
# Merge with broker contract name
fields = ["Contract_ID", "Contract_name", "Broker name"]
contract_df = pd.read_csv("broker_contracts.csv", skipinitialspace=True, usecols=fields)
print(contract_df.shape)
print(contract_df.dtypes)
contract_df.columns = ["broker_contract_id", "Contract_name", "Broker_name"]
merged_requests = pd.merge(df, contract_df, on=["broker_contract_id"])
print(merged_requests.dtypes)
One of the questions of the project was to see how many of the requests are duplicates and how does the demand change after removing the duplicates.
When exploring general demand we do not care about car specific columns and therefore we ignore them.
In general there were two ways to detect duplicates:
1) According to rate quotes data's "request_uuid" column. When a broker asks for prices then it makes several requests, each for several cars at a time. Requests with same contract share the request_uuid field.
2) Since there could be several different contracts for a broker then when broker asks for prices there will be requests with different request_uuid's, but still actually would count as one "demand".
For first case it is easy to answer. Since when we are exploring demands we want to remove all duplicates with same request_uuid. In this case we just drop duplicates according to request_uuid and just take the first one. Since we do not care about car specific columns then we can just ignore them. Pickup_time, return_time, source country et cetera will still be same for all of them.
In order to solve the case where there are different contracts the way we can look at it is to see requests which are made close together. If two requests which are timewise close together and share all the other values but request_uuid then they probable are made together in same request. Dependent on how close we require, timewise, we ask the requests it be there will be more or less false positives.
Group data by request_uuid, for each find minimal timestamp and set it as aggregate row's timestamp. We can use drop_duplicates, which will get first of each (we make the assumption that the data is in chronological order)..
# Used first one to make sure that the timestamp we pick is the first one.
## Two ways to do it. Second one guarantees even if the data is not in chronological order
first_from_each_requestuuid = merged_requests.drop_duplicates("request_uuid")
#first_from_each_requestuuid = merged_requests[merged_requests.groupby('request_uuid')['timestamp'].rank() == 1 ]
Before: 2997586 After: 130915 Decrease of 95.63265240763735%
We see that there are A LOT of duplicates based already only on request_uuid. With this we can say that brokers on average request data on 23 cars (1/(1-0.95632)).
before_count = len(merged_requests)
after_count = len(first_from_each_requestuuid)
print("Before:", before_count, "After:", after_count)
print("Decrease of {}%".format((100-(after_count/before_count)*100)))
## Saving data frame (without duplicates by requestuuid)
first_from_each_requestuuid.to_csv("rate_quote_1_dup2.csv")
Different request_uuid, but other parameters same (nearby timestamp)
1) Group by broker name
2) Group by all other columns which should remain the same: (["pickup_timestamp", "return_timestamp", "driver_age", "source_country_region_id", "pickup_desk_id", "return_desk_id", "rental_days"]
3) Check for orders within some set time difference, if other parameters except contract_id / request_uuid are the same then merge to be one request.
Currently the MAX_TIME_DIFFERENCE we are looking at is MAX_TIME_DIFFERENCE = 200000 # How many microseconds
## Group by broker_name
grouped_by_broker = first_from_each_requestuuid.groupby(first_from_each_requestuuid["Broker_name"])
MAX_TIME_DIFFERENCE = 200000 # How many microseconds
filtered_by_broker = {}
# For every order find all orders in time difference of 1 second.
duplicate_count = 0
for group in grouped_by_broker:
previous_row = None
in_difference = []
# Group by all other columns now..
all_other = group[1].groupby(["pickup_timestamp", "return_timestamp", "driver_age",
"source_country_region_id", "pickup_desk_id", "return_desk_id", "rental_days"])
filtered_rows = []
for group_other in all_other:
SAME_PARAMETERS = group_other[1].sort_values(by="timestamp")
# Now from inside group find merge which ones have similar timestamp (Need to set an E - 0.5s?) Consult on this
previous_row = None
for row in SAME_PARAMETERS.itertuples():
if previous_row:
# If time difference is enough then add the row.
if (row[1] - previous_row[1]).microseconds > MAX_TIME_DIFFERENCE:
filtered_rows.append(row)
else:
duplicate_count += 1
else:
# We add first one no matter what.
filtered_rows.append(row)
previous_row = row
filtered_by_broker[group[0]] = filtered_rows
print("Removed {} duplicates".format(duplicate_count))
print("Before removing duplicates in this phase")
for group in grouped_by_broker:
print(group[0], len(group[1]))
print("\nAfter removing duplicates")
for k,v in filtered_by_broker.items():
print(k, len(v))
filtered_df = pd.DataFrame([item for sub in filtered_by_broker.keys() for item in filtered_by_broker[sub]])
print("Before removing any duplicates:", before_count, "After:", len(filtered_df))
print("Decrease of {}%".format((100-(len(filtered_df)/before_count)*100)))
filtered_df.to_csv("new_rate_quote_1.csv")
After just remoing duplicates by request_uuid we got a decrease of ~95%. After also doing some additional removal by checking nearby timestamps and merging rows we got a total decrease of 97.3%. But since the last part might have false positives or negatives, which changes by changing MAX_TIME_DIFFERENCE.
After filtering (removing duplicates and unnecessary columns), a price quotes datafile of size 23 GB was reduced to 10,000 KB file.
# requests per datetime
df_requests = pd.DataFrame(df["timestamp"])
requests_per_day = df_requests.groupby(df_requests["timestamp"].dt.date).count()
print(requests_per_day)
requests_per_day.plot.bar()
filtered_per_contract = filtered_df.groupby(filtered_df["Broker_name"])
reqs_separately = merged_requests[["timestamp", "Broker_name"]]
requests_per_contract_ = reqs_separately.groupby(reqs_separately["Broker_name"])
requests_per_contractor = requests_per_contract_.count().plot(kind="bar", title="Request count by contractor")
requests_per_contractor.set_ylabel("Request count")
requests_per_contractor = reqs_separately.groupby(reqs_separately["Broker_name"]).count().plot(kind="bar", title="Request count by contractor in log scale")
requests_per_contractor.set_yscale("log", nonposy='clip')
requests_per_contractor.set_ylabel("Request count")
contractors = []
for group in requests_per_contract_:
time_contract = pd.DataFrame(group[1]["timestamp"])
contractors.append((group[0], time_contract.groupby(time_contract["timestamp"].dt.date).count()))
contractors_2 = []
for group in filtered_per_contract:
time_contract = pd.DataFrame(group[1]["timestamp"])
contractors_2.append((group[0], time_contract.groupby(time_contract["timestamp"].dt.date).count()))
plt = time_contract.groupby(time_contract["timestamp"].dt.date).count().plot(
kind="bar", title="Requests per day for contractor on filtered data " + str(group[0])
)
plt.set_ylabel("Request count")
plt.set_xlabel("Request timestamp")
l_new = []
for con in contractors_2:
for row in con[1].itertuples():
l_new.append([con[0], row[0], row[1]])
df_contractor_dreq_new = pd.DataFrame(l_new)
df_contractor_dreq_new.columns = ["contractor", "date", "count"]
l = []
for con in contractors:
for row in con[1].itertuples():
l.append([con[0], row[0], row[1]])
df_contractor_dreq = pd.DataFrame(l)
df_contractor_dreq.columns = ["contractor", "date", "count"]
a = df_contractor_dreq.groupby("contractor").sum()
b = df_contractor_dreq_new.groupby("contractor").sum()
result = pd.concat([a,b], axis=1)
result.columns = ["before", "after"]
result["contractor"] = result.index
print(result)
melted=pd.melt(result, id_vars=["contractor"])
import seaborn as sns
import matplotlib.pyplot as plt
ax = sns.barplot(x="contractor", y='value', hue="variable", data=melted, log=True)
ax.set(xlabel='Broker', ylabel='Number of requests', title="Number of requests before and after removing duplicates (log scale)")
plt.ylim(1, 10e6)
# For each contractor see how many requests per day..
contractors_dups = []
for group in df_dups.groupby(df_dups["Broker_name"]):
time_contract = pd.DataFrame(group[1]["timestamp"])
contractors_dups.append((group[0], time_contract.groupby(time_contract["timestamp"].dt.date).count()))
l_dups = []
for con in contractors_dups:
for row in con[1].itertuples():
l_dups.append([con[0], row[0], row[1]])
df_con_dups = pd.DataFrame(l_dups)
df_con_dups.columns = ["contractor", "date", "count"]
import seaborn as sns
import matplotlib.pyplot as plt
plt.ylim(10, 40)
sns.barplot(x='contractor', y='count', hue='date', data=df_con_dups, log=True)
plt.ylim(1, 10e6)
On data w/o duplicates.
On here we can also see an unexpected peak - February 7th, which is trending on several days.
# group data by day
daylist = []
days = []
for group in df.groupby(df["timestamp"].dt.date):
days.append(group[0])
daylist.append(group[1])
for i, day in enumerate(daylist):
day_pickup = pd.DataFrame(day["pickup_timestamp"])
plt = day_pickup.groupby(day_pickup["pickup_timestamp"].dt.date).count()
data = [go.Scatter(x=plt.index, y=plt["pickup_timestamp"])]
layout = go.Layout(
title='Number of requests per pickup dates for ' + str(days[i]),
yaxis=dict(
title='Count of requests'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
requests = df_requests.reset_index().set_index("timestamp")
resampled_requests = requests.resample("10Min").count()
title = "Number of requests in time in period from 1st December to end of 3rd December (interval 10 min)"
data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]
layout = go.Layout(
title=title,
yaxis=dict(
title='Count of requests'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
for i, day in enumerate(daylist):
reqs = pd.DataFrame(day["timestamp"]).reset_index().set_index("timestamp").resample("5Min").count()
data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]
layout = go.Layout(
title="Number of requests on " + str(days[i]) + " (interval 5 min)",
yaxis=dict(
title='Count of requests'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
For each request we have pickup date and return date. Lets assume we have a request with pickup_date 27.11.2017 and return date 31.11.2017, then such request will add one value to each following day 27,28,29,30,31 (every day inbetween and including 27 and 31).
With this we can get some insight to which dates would have probably the largest number of active reservations. From here we can also see the unexpected peaks in the data - which dates are more interesting to people - unexpected peaks will be dates which have way more interest than neighbouring ones.
For example on the following graphs we can see that 3rd January is unexpectedly popular. Might be due to some conference being around that time.
To solve this we solve problem: Maximum number of overlapping intervals. Source: https://www.geeksforgeeks.org/find-the-point-where-maximum-intervals-overlap/
First we sort both pickup and return times by date. Then we iterate over all days between first pickup_date and last return_date.
For each day we count how many pickup times are on that day and how many returns are on that day. Each pickup on that day means that there is one more active reservation during that day and one return means that there is one less.
Also to make this considerably faster you can run this on data without duplicates, since distributions should remain the same (by broker)
daylist = []
days = []
for group in filtered_df.groupby(filtered_df["timestamp"].dt.date):
days.append(group[0])
daylist.append(group[1])
def count_active_on_day(df):
# Algorithm to find how many requests are made for a certain day.
# Accepts a dataframe with pickup dates and return dates (columns pickup_timestamp and return_timestamp)
# Effective version based on this: https://www.geeksforgeeks.org/find-the-point-where-maximum-intervals-overlap/
# Returns a dictionary with key being date and values being how many requests for that date.
pickup_sort = df["pickup_timestamp"].dt.normalize().sort_values().reset_index(drop=True)
return_sort = df["return_timestamp"].dt.normalize().sort_values().reset_index(drop=True)
start = pickup_sort[0]
end = return_sort[len(df)-1]
pickup_ind = 0 # which index are we currently on?
return_ind = 0
opened_bracket_count = 0
counts_new = {}
length = len(df)
# For every date between first pickup_rate and last return_date
# find how many open reservations would be based on price requests.
for date in pd.date_range(start, end, normalize=True):
## this means that a reservation ends on this day
while opened_bracket_count >= 1 and return_sort[return_ind] < date:
## Remove one
opened_bracket_count -= 1
## Move one index forward in return time array
return_ind += 1
## This means that reservation starts on this day
while pickup_ind < length and pickup_sort[pickup_ind] <= date:
## Add one
opened_bracket_count += 1
pickup_ind += 1
counts_new[date] = opened_bracket_count
return counts_new
from collections import defaultdict
# Do the same, now using faster algorithm.
day_counts = []
for day in daylist:
day_counts.append(count_active_on_day(day))
for i in range(3):
time_series_1 = pd.DataFrame(pd.Series(day_counts[i]))
time_series_1.index.name = "timestamp"
time_series_1.columns = ["count"]
title="How many times a day is between pickup and return date (on filtered data) " + str(days[i])
df1=time_series_1.groupby(pd.Grouper(freq="D")).sum()
data = [go.Scatter(x=df1.index, y=df1["count"])]
layout = go.Layout(
title=title,
yaxis=dict(
title='Count of occurences'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
Lets separate them by broker, but combine days.
We can see more on unexpected peaks - seems there is unexpected interest on early January and February 16
print(len(filtered_df))
print(len(df_dups))
# For each contractor find hot dates.
pickups_per_broker = filtered_df[["pickup_timestamp", "return_timestamp", "Broker_name"]]
broker_dict = {}
for group in pickups_per_broker.groupby(pickups_per_broker["Broker_name"]):
broker_dict[group[0]] = count_active_on_day(group[1])
for k,v in broker_dict.items():
time_series_1 = pd.DataFrame(pd.Series(v))
time_series_1.columns = ["count"]
df1 = time_series_1.groupby(pd.Grouper(freq="D")).sum()
data = [go.Scatter(x=df1.index, y=df1["count"])]
layout = go.Layout(
title="Hottest days for "+ k +" by day (on request data)",
yaxis=dict(
title='Count of occurences of the day between all pickup and return times'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
# For each contractor find hot dates.
import cufflinks as cf
cf.go_offline()
fields = ["timestamp", "pickup_timestamp", "return_timestamp", "Broker_name"]
df_dups = filtered_df
df_dups["timestamp"] = pd.to_datetime(df_dups['timestamp'])
df_dups["pickup_timestamp"] = pd.to_datetime(df_dups['pickup_timestamp'])
df_dups["return_timestamp"] = pd.to_datetime(df_dups['return_timestamp'])
broker_dict_dups = defaultdict(dict)
for group in df_dups.groupby(df_dups["Broker_name"]):
for g in group[1].set_index("timestamp").groupby(pd.Grouper(freq="D")):
broker_dict_dups[group[0]][g[0]] = count_active_on_day(g[1])
brokers = df_dups["Broker_name"].unique()
for broker in brokers:
k = broker
v = broker_dict_dups[k]
data = pd.DataFrame(v)
# print(data)
data.iplot(kind='scatter', filename='cufflinks/cf-simple-line', title=k + " requests hot days on filtered data", yTitle="Count of occurences", xTitle="Date")
fields = ["rate_request_timestamp", "reservation_request_timestamp", "pickup_timestamp", "return_timestamp"]
df = pd.read_csv("reservations.csv", skipinitialspace=True, usecols=fields)
print(df.shape)
df["rate_request_timestamp"] = pd.to_datetime(df['rate_request_timestamp'])
df["reservation_request_timestamp"] = pd.to_datetime(df['reservation_request_timestamp'])
df["pickup_timestamp"] = pd.to_datetime(df['pickup_timestamp'])
df["return_timestamp"] = pd.to_datetime(df['return_timestamp'])
print(df.dtypes)
To see how many requests were made on a certain date
df_reservation_requests = df["reservation_request_timestamp"]
reservation_requests = df_reservation_requests.reset_index().set_index("reservation_request_timestamp")
resampled_requests = reservation_requests.resample("1D").count()
data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]
title="Number of reservation requests in time"
layout = go.Layout(
title=title,
yaxis=dict(
title='Count of occurences'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
Note that the data since December is in the future and before that is in the past. We'll look at this later.
df_pickup_times = df["pickup_timestamp"]
pickup_times = df_pickup_times.reset_index().set_index("pickup_timestamp")
resampled_requests = pickup_times.resample("1D").count()
data = [go.Scatter(x=resampled_requests.index, y=resampled_requests["index"])]
title="Number of pickup times on a date (daily) in reservation data"
layout = go.Layout(
title=title,
yaxis=dict(
title='Count of occurences'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
All in the future. From here we can see for example that february 18 is unexpectedly popular.
fields = ["rate_request_timestamp", "reservation_request_timestamp", "pickup_timestamp",
"return_timestamp", "Broker name", "reservation_status_type"]
df_res = pd.read_csv("reservations.csv", skipinitialspace=True, usecols=fields)
print(df_res.shape)
df_res = df_res[df_res.reservation_status_type == "CONFIRMED"]
df_res["rate_request_timestamp"] = pd.to_datetime(df_res['rate_request_timestamp'])
df_res["reservation_request_timestamp"] = pd.to_datetime(df_res['reservation_request_timestamp'])
df_res["pickup_timestamp"] = pd.to_datetime(df_res['pickup_timestamp'])
df_res["return_timestamp"] = pd.to_datetime(df_res['return_timestamp'])
print(df_res.dtypes)
# For each contractor find hot dates.
broker_dict_res = {}
for group in df_res.groupby(df_res["Broker name"]):
broker_dict_res[group[0]] = count_active_on_day(group[1])
for k,v in broker_dict_res.items():
time_series_1 = pd.DataFrame(pd.Series(v))
time_series_1.index.name = "timestamp"
time_series_1.columns = ["count"]
df1=time_series_1.groupby(pd.Grouper(freq="D")).sum()
title="Hottest days for "+ k +" by day (on CONFIRMED reservation data)"
data = [go.Scatter(x=df1.index, y=df1["count"])]
layout = go.Layout(
title=title,
yaxis=dict(
title='Count of occurences'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
fields = ["rate_request_timestamp", "reservation_request_timestamp", "pickup_timestamp",
"return_timestamp", "Broker name", "reservation_status_type"]
df_res = pd.read_csv("reservations.csv", skipinitialspace=True, usecols=fields)
print(df_res.shape)
df_res = df_res[df_res.reservation_status_type == "COMPLETED"]
df_res["rate_request_timestamp"] = pd.to_datetime(df_res['rate_request_timestamp'])
df_res["reservation_request_timestamp"] = pd.to_datetime(df_res['reservation_request_timestamp'])
df_res["pickup_timestamp"] = pd.to_datetime(df_res['pickup_timestamp'])
df_res["return_timestamp"] = pd.to_datetime(df_res['return_timestamp'])
print(df_res.dtypes)
# For each contractor find hot dates.
broker_dict_res = {}
for group in df_res.groupby(df_res["Broker name"]):
broker_dict_res[group[0]] = count_active_on_day(group[1])
for k,v in broker_dict_res.items():
time_series_1 = pd.DataFrame(pd.Series(v))
time_series_1.index.name = "timestamp"
time_series_1.columns = ["count"]
df1=time_series_1.groupby(pd.Grouper(freq="D")).sum()
title="Hottest days for "+ k +" by day (on COMPLETED reservation data)"
data = [go.Scatter(x=df1.index, y=df1["count"])]
layout = go.Layout(
title=title,
yaxis=dict(
title='Count of occurences'
)
)
fig = go.Figure(data=data, layout=layout)
iplot(fig)
What percentage of requests are made to for next day? next 3 days? next 7 days? next 14 days? next 30 days? next 60 days? next 90 days? next 180 days? next 360 days?
fields = ["timestamp", "pickup_timestamp", "return_timestamp", "Broker_name"]
df_dups = pd.read_csv("rateq_dup.csv", skipinitialspace=True, usecols=fields)
print(df_dups.shape)
df_dups["timestamp"] = pd.to_datetime(df_dups['timestamp'])
df_dups["pickup_timestamp"] = pd.to_datetime(df_dups['pickup_timestamp'])
df_dups["return_timestamp"] = pd.to_datetime(df_dups['return_timestamp'])
print(df_dups.dtypes)
daylist = []
days = []
for group in df_dups.groupby(df_dups["timestamp"].dt.date):
days.append(group[0])
daylist.append(group[1])
cont_dict = defaultdict(dict)
day_offsets = [0, 1,3,7,14,30,60,90,180]
for daynr, day in enumerate(daylist):
broker_dict_dups = {}
for group in day.groupby(df_dups["Broker_name"]):
broker_dict_dups[group[0]] = count_active_on_day(group[1])
for k,v in broker_dict_dups.items():
time_series_1 = pd.DataFrame(pd.Series(v))
time_series_1.index.name = "timestamp"
time_series_1.columns = ["count"]
summ = time_series_1.sum()
day_counts = []
for off in day_offsets:
day_counts.append(time_series_1.loc[time_series_1.index.min():time_series_1.index.min()+pd.DateOffset(off)].sum())
proportions = [float(d/summ) for d in day_counts]
df_prop = pd.DataFrame(
{"first" : day_offsets,
"proportion" : proportions
})
cont_dict[k][daynr] = df_prop
for broker in brokers:
dc = pd.DataFrame([cont_dict[broker][i]["proportion"] for i in cont_dict[broker].keys()]).reset_index().T
dc = dc.drop(dc.index[0])
dc = dc.infer_objects()
if not dc.empty:
dc.columns = ([str(days[i]) for i in cont_dict[broker].keys()])
dc["days"] = [0,1,3,7,14,30,60,90,180]
layout = go.Layout(
title = "Broker " + broker + " proportions of requests of first days",
titlefont=dict(
family='Courier New, monospace'
),
xaxis=dict(
showgrid=True,
zeroline=True,
showline=True,
#ticks='',
showticklabels=True,
title="Number of days ahead",
range=[0,185]
),
yaxis=dict(
showgrid=True,
zeroline=True,
showline=True,
#ticks='',
showticklabels=True,
range = [0,1],
title="Proportion of requests"
)
)
data = [go.Scatter(x=dc["days"],y=dc[day], name=day) for day in dc.columns[:-1]]
fig = go.Figure(data=data, layout=layout)
iplot(fig)
We can see that on most of the days 25% of requests request cars inside upcoming 30 days and half of the requests are within first 90 days. Noting that the derivative of the above line will get smaller as we we go forward. This means that there is more interest in earlier days than days after.
How many price requests could be cached (have same pickup and return location, source country and driver age) for 1 hour, 3 hours, 12 hours? Per contractor.
We will use initial data for these tasks.
From request data see how many responses are exactly the same. For this we'll use
To do this we check how many duplicates there are in "Charge_log" column. Charge_log column is the response to the request. This gives somewhat of an idea of the problem.
fields = ["charge_log"]
df_cache = pd.read_csv("rate_quote.csv", skipinitialspace=True, usecols=fields)
# Drop duplicates
dropped = df_cache.drop_duplicates("charge_log")
print("Before removing any duplicates:", len(df_cache), "After:", len(dropped))
print("Decrease of {}%".format((100-(len(dropped)/len(df_cache))*100)))
We see that 99.1% of responses are the same. This of course should be taken with a grain of salt, as there can be a lot of coincidences in pricing between cars with different models et cetera. We can see that there are 26226 different responses in first data.
As we talked with the client we were told that there are following factors, which are used in creating a price response (consoidering that the market prices of cars havent changed in meanwhile):
pickup time,
rental duration,
pickup and return location,
source country,
driver age.
We looked at three different scenarios (for each contract separately and then merged by broker):
First we only looked at how many requests have same source, pickup, return, driver age.
Secondly we looked what would be if in addition return_timestamp and pickup_timestamp would also be exactly the same
Thirdly we looked that pickup day, driver age, pickup location, drop location, rental days length
In this notebook we only show the last one, the others are also in caching notebook separately.
For this I wrote function cached_by_fields.
Practically what it does is for each cache interval (60 minutes, 180 minutes, 720 minutes), in every 30 minutes it finds how many of requests nade in last 30 minutes could be cached according to those criteria.
Note that for faster calculations we make the assumption that cache is updated every 30 minutes. Not that this isn't totally accurate representation because there could be more duplicate requests in past 30 minutes.
This means that if we are looking at requests between 12:30-13:00 then in cache there are (12:30 - 60/180/720 minutes) - 12:30
# Use data without duplicates
fields = ["timestamp", "pickup_timestamp", "return_timestamp", "broker_contract_id",
"source_country_region_id", "driver_age", "pickup_desk_id", "return_desk_id", "rental_days",
"Broker_name"]
df_dups = pd.read_csv("new_rate_quote_1.csv", skipinitialspace=True, usecols=fields)
print(df_dups.shape)
df_dups["timestamp"] = pd.to_datetime(df_dups['timestamp'])
df_dups["pickup_timestamp"] = pd.to_datetime(df_dups['pickup_timestamp'])
df_dups["return_timestamp"] = pd.to_datetime(df_dups['return_timestamp'])
from collections import defaultdict
def cached_by_fields(dup_fields, merged_requests):
"""
dup_fields: list of fields which ahve to be the same for it to be "cached".
merged_requests: dataframe to work on.
returns merged_datas, which is a list of dictionaries for each time interval (60 min, 180 min, 720 min)
Each of those dictionaries has keys of contractors which in accordance refers to a dictionary with keys being
date.
Note that there are many places to optimize in here for faster performance.
"""
## Set index to be timstamp
df_new_dups_2 = merged_requests.set_index("timestamp")
## Get interesteing columns
df_new_dups_2 = df_new_dups_2[dup_fields]
## Normalize pickup_timestamps by hour.
df_new_dups_2["pickup_timestamp"] = df_new_dups_2["pickup_timestamp"].dt.normalize()
## Find which brokers in data
brokers = merged_requests["Broker_name"].unique()
merged_datas = []
# For 60 minutes, 180 minutes and 720 minutes.
for frq in [60,180,720]:
merged_data_3 = defaultdict(dict)
## For each broker/contractor in data
for contractor in brokers:
## Select rows for the broker name
df_contractor = df_new_dups_2[df_new_dups_2["Broker_name"] == contractor]
if not df_contractor.empty:
### Sort contractor data by time
df_contractor = df_contractor.sort_index()
date_range = pd.date_range(df_contractor.index.min(), df_contractor.index.max(), freq="30Min")
for i in range(len(date_range)-1):
## Find rows which are between two dates (Those requests are the ones that are cached)
cached_requests = df_contractor[date_range[i]-pd.Timedelta(minutes=frq):date_range[i]]
## From between_dates find current requests.
current_requests = df_contractor[date_range[i+1]-pd.Timedelta(minutes=30):date_range[i+1]]
## How many of current requests are not in the cache:
not_in_cache = pd.merge(current_requests,
cached_requests, how='outer', indicator=True)
not_in_cache = not_in_cache[not_in_cache["_merge"] == "left_only"]
## How many requests made last 30 minutes
total = len(current_requests)
## How many non-duplicates between two dates
non_dup_count = len(not_in_cache)
cached = total - non_dup_count
## What is the proportion
proportion = 0 if total == 0 else cached/total
## Add the sum of cached, total requests and proportion to dictionary
merged_data_3[contractor][date_range[i+1]] = [cached, total, proportion]
merged_datas.append(merged_data_3)
return merged_datas
dup_fields = ["driver_age", "source_country_region_id", "broker_contract_id",
"pickup_desk_id", "return_desk_id", "pickup_timestamp", "rental_days", "Broker_name"]
merged_datas = cached_by_fields(dup_fields, merged_requests)
First one is with absolute number, the second one is with proportions.
for broker in merged_requests["Broker_name"].unique():
tJigs = []
for merged_data in merged_datas:
tJig = pd.DataFrame(merged_data[broker]).T
if tJig.empty:
continue
tJigs.append(tJig)
layout = go.Layout(
title = broker + " absolute number of requests cached in 30 minute intervals with different cache lengths",
titlefont=dict(
family='Courier New, monospace'
),
xaxis=dict(
showgrid=True,
zeroline=True,
showline=True,
showticklabels=True,
title="Date",
),
yaxis=dict(
showgrid=True,
zeroline=True,
showline=True,
showticklabels=True,
title="Absolute count of cacheable requests"
)
)
data = []
try:
data = [go.Scatter(x=tJigs[0].index, y=tJigs[0][0], name="1 hour"),
go.Scatter(x=tJigs[1].index, y=tJigs[1][0], name="3 hours"),
go.Scatter(x=tJigs[2].index, y=tJigs[2][0], name="12 hours")]
except:
pass
if data:
fig = go.Figure(data=data, layout=layout)
iplot(fig)
for broker in merged_requests["Broker_name"].unique():
tJigs = []
for merged_data in merged_datas:
tJig = pd.DataFrame(merged_data[broker]).T
if tJig.empty:
continue
tJigs.append(tJig)
layout = go.Layout(
title = broker + " proportion of requests cached in 30 minute intervals with different cache lengths",
titlefont=dict(
family='Courier New, monospace'
),
xaxis=dict(
showgrid=True,
zeroline=True,
showline=True,
showticklabels=True,
title="Date",
),
yaxis=dict(
showgrid=True,
zeroline=True,
showline=True,
showticklabels=True,
title="Proportion of cacheable requests"
)
)
data = []
try:
data = [go.Scatter(x=tJigs[0].index, y=tJigs[0][2], name="1 hour"),
go.Scatter(x=tJigs[1].index, y=tJigs[1][2], name="3 hours"),
go.Scatter(x=tJigs[2].index, y=tJigs[2][2], name="12 hours")]
except:
pass
if data:
fig = go.Figure(data=data, layout=layout)
iplot(fig)
First of all we can see that for brokers with low activity, using a timed cache might not really be a good idea as the proportion of cached requests is almost 0. For TravelJigsaw we can see that actually the proportion of cached requests can get up to around 50% at some periods.
Anyways, since there are different number of requests dependent on the time of the day then using only time as caching criteria might not be good idea. Could also set some lower limit on the cache row count.
Note also that this is an approximation of what would've happened in real, since in this case requests which are in cache are in previous timeframe. Ex if we are looking at requests from 12:30 to 13:00 then requests which are in cache are for 60 minute case 11:30 to 12:30.